66 research outputs found

    UPM-UC3M system for music and speech segmentation

    Get PDF
    This paper describes the UPM-UC3M system for the Albayzín evaluation 2010 on Audio Segmentation. This evaluation task consists of segmenting a broadcast news audio document into clean speech, music, speech with noise in background and speech with music in background. The UPM-UC3M system is based on Hidden Markov Models (HMMs), including a 3-state HMM for every acoustic class. The number of states and the number of Gaussian per state have been tuned for this evaluation. The main analysis during system development has been focused on feature selection. Also, two different architectures have been tested: the first one corresponds to an one-step system whereas the second one is a hierarchical system in which different features have been used for segmenting the different audio classes. For both systems, we have considered long term statistics of MFCC (Mel Frequency Ceptral Coefficients), spectral entropy and CHROMA coefficients. For the best configuration of the one-step system, we have obtained a 25.3% average error rate and 18.7% diarization error (using the NIST tool) and a 23.9% average error rate and 17.9% diarization error for the hierarchical one

    Robust Speech Detection for Noisy Environments

    Get PDF
    This paper presents a robust voice activity detector (VAD) based on hidden Markov models (HMM) to improve speech recognition systems in stationary and non-stationary noise environments: inside motor vehicles (like cars or planes) or inside buildings close to high traffic places (like in a control tower for air traffic control (ATC)). In these environments, there is a high stationary noise level caused by vehicle motors and additionally, there could be people speaking at certain distance from the main speaker producing non-stationary noise. The VAD presented in this paper is characterized by a new front-end and a noise level adaptation process that increases significantly the VAD robustness for different signal to noise ratios (SNRs). The feature vector used by the VAD includes the most relevant Mel Frequency Cepstral Coefficients (MFCC), normalized log energy and delta log energy. The proposed VAD has been evaluated and compared to other well-known VADs using three databases containing different noise conditions: speech in clean environments (SNRs mayor que 20 dB), speech recorded in stationary noise environments (inside or close to motor vehicles), and finally, speech in non stationary environments (including noise from bars, television and far-field speakers). In the three cases, the detection error obtained with the proposed VAD is the lowest for all SNRs compared to Acero¿s VAD (reference of this work) and other well-known VADs like AMR, AURORA or G729 annex b

    Mejora de servicios por teléfono con reconocimiento de habla. Nueva generación de servidores vocales interactivos

    Full text link
    En este trabajo se ha realizado un análisis e investigación en tres aspectos importantes que forman parte de un Servidor Vocal Interactivo (SVI): reconocimiento automático del habla, obtención de medidas de confianza para la detección de errores en los módulos de reconocimiento y compresión de lenguaje natural, y por último, se ha invertido un esfuerzo importante en el módulo de gestión del diálogo. En cuanto al módulo de reconocimiento, se ha realizado un estudio de la tarea de deletreo en castellano y se ha implementado el primer reconocedor de nombres deletreados en castellano con tasas de acierto comparables a los realizados en otros idiomas. En un primer paso se han evaluado diferentes estrategias de reconocimiento eligiendo una solución basada en una arquitectura de hipótesis y verificación que ofrece un mejor compromiso entre tasa de reconocimiento y tiempo de proceso. Sobre esta arquitectura, se han incorporado nuevas ideas para hacer frente a las peculiaridades de la tarea de deletreo en nuestro idioma, como la generación de modelos de silencios contextuales. Por otro lado, se ha desarrollado un reconocedor de habla continua para frases que expresan fechas y horas. Ambos sistemas han sido diseñados y entrenados para su funcionamiento por línea telefónica e independiente del locutor. En relación con el análisis de medidas de confianza, se ha trabajado fundamentalmente sobre el sistema DARPA Communicator desarrollado en el Centro de Investigación de Lenguaje Hablado (CSLR: The Center for Spoken Language Research) de la Universidad de Colorado (Boulder) en Estados Unidos. Sobre este sistema se han realizado estudios independientes para los niveles de palabra, concepto semántico y frase completa. Por otro lado, también se han realizado análisis para los reconocedores implementados en la presente tesis, centrándonos en los niveles de frase para el sistema de nombres deletreados, y en el nivel de palabra para el reconocedor desarrollado en el dominio de fechas y horas. En esta parte del estudio se propone la utilización de las medidas de confianza como heurístico para la combinación de varias hipótesis de reconocimiento obtenidas de diferentes decodificadores. En relación con la gestión del diálogo se propone una metodología de diseño en la que se combina información de diferentes fuentes: análisis de base de datos, observación de conversaciones reales, simulación del servicio y funcionamiento con usuarios reales. Esta metodología está formada por 5 fases. En la primera fase se realiza un análisis de la base de datos con la información disponible para ofrecer el servicio. En la segunda etapa "diseño por intuición", se propone la técnica de "braim-storming" para plantear diferentes opciones de diseño. En el diseño por observación (fase tercera), se analizan conversaciones entre los usuarios y operadores humanos para evaluar diferentes alternativas de diseño. En la cuarta fase (diseño por simulación) utilizamos la herramienta de Mago de Oz para simular una interacción usuario-sistema. Por último, en la etapa de mejora iterativa se describe la utilización de medidas de confianza para el diseño de los mecanismos de confirmación y se describe una técnica para el modelado del usuario basada en niveles de destreza. La presentación de esta metodología se ha realizado mediante su aplicación al caso de un servicio de información y reserva de billetes de tren

    UPM system for the translation task

    Get PDF
    This paper describes the UPM system for translation task at the EMNLP 2011 workshop on statistical machine translation (http://www.statmt.org/wmt11/), and it has been used for both directions: Spanish-English and English-Spanish. This system is based on Moses with two new modules for pre and post processing the sentences. The main contribution is the method proposed (based on the similarity with the source language test set) for selecting the sentences for training the models and adjusting the weights. With system, we have obtained a 23.2 BLEU for Spanish-English and 21.7 BLEU for EnglishSpanis

    Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector. Computers & Electrical Engineering (CAEE).

    Full text link
    Nowadays, several computational techniques for speech recognition have been proposed. These techniques suppose an important improvement in real time applications where speaker interacts with speech recognition systems. Although researchers proposed many methods, none of them solve the high false alarm problem when far-field speakers interfere in a human-machine conversation. This paper presents a two-class (speech and non-speech classes) decision-tree based approach for combining new speech pulse features in a VAD (Voice Activity Detector) for rejecting far-field speech in speech recognition systems. This Decision Tree is applied over the speech pulses obtained by a baseline VAD composed of a frame feature extractor, a HMM-based (Hidden Markov Model) segmentation module and a pulse detector. The paper also presents a detailed analysis of a great amount of features for discriminating between close and far-field speech. The detection error obtained with the proposed VAD is the lowest compared to other well-known VAD

    Review of Research on Speech Technology: Main Contributions From Spanish Research Groups

    Get PDF
    In the last two decades, there has been an important increase in research on speech technology in Spain, mainly due to a higher level of funding from European, Spanish and local institutions and also due to a growing interest in these technologies for developing new services and applications. This paper provides a review of the main areas of speech technology addressed by research groups in Spain, their main contributions in the recent years and the main focus of interest these days. This description is classified in five main areas: audio processing including speech, speaker characterization, speech and language processing, text to speech conversion and spoken language applications. This paper also introduces the Spanish Network of Speech Technologies (RTTH. Red Temática en Tecnologías del Habla) as the research network that includes almost all the researchers working in this area, presenting some figures, its objectives and its main activities developed in the last years

    UPM system for WMT 2012

    Get PDF
    This paper describes the UPM system for the Spanish-English translation task at the NAACL 2012 workshop on statistical machine translation. This system is based on Moses. We have used all available free corpora, cleaning and deleting some repetitions. In this paper, we also propose a technique for selecting the sentences for tuning the system. This technique is based on the similarity with the sentences to translate. With our approach, we improve the BLEU score from 28.37% to 28.57%. And as a result of the WMT12 challenge we have obtained a 31.80% BLEU with the 2012 test set. Finally, we explain different experiments that we have carried out after the competition

    Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector

    Get PDF
    ABSTRACT 1.-Introduction The advantages of using Automatic Speech Recognition are obvious for several types of applications. Speech Recognition becomes difficult when the main speaker is in noisy environments, for example in bars, where many far-field speakers are speaking almost all the time. This factor contributes to a reduction in the speech recognizer success rate that can lead to an unsatisfactory experience for the user. If there are too many recognition mistakes, the user is forced to correct the system which takes too long, it is a nuisance, and the user will finally reject the system. With the purpose of solving this problem a Robust Voice Activity Detector is proposed in this work. The VAD is able to select speech frames (noise frames are discarded). This frame information is sent to the Speech Recognizer and only speech pronunciations are processed, so the VAD tries to avoid Speech Recognizer mistakes coming from noisy frames. If the VAD works well, the Speech Recognizer does too. In summary, it is very common to find, in mobile phone scenarios, many situations in which the target speaker is situated in open environments surrounded by far-field interfering speech from other speakers. In this ambiguous case, VAD systems can detect far-field speech as coming from the user, increasing the speech recognition error rate. Generally, detection errors caused by background voices mainly increase word insertions and substitutions, leading to significant dialogue misunderstandings. This work tries to solve these speech-based application problems in which far-field speech can be wrongly considered as main speaker speech. In [1] a spectrum sensing scheme to detect the presence of the primary user for cognitive radio systems is proposed (very similar to the VAD proposed in this paper) being able to distinguish between main speaker speech and far-field speech. Moreover the system implemented in In several previous works, similar measurements, like those considered in this work, have been used for dereverberation techniques. I

    Clustering of syntactic and discursive information for the dynamic adaptation of Language Models

    Get PDF
    Presentamos una estrategia de agrupamiento de elementos de diálogo, de tipo semántico y discursivo. Empleando Latent Semantic Analysis (LSA) agru- pamos los diferentes elementos de acuerdo a un criterio de distancia basado en correlación. Tras seleccionar un conjunto de grupos que forman una partición del espacio semántico o discursivo considerado, entrenamos unos modelos de lenguaje estocásticos (LM) asociados a cada modelo. Dichos modelos se emplearán en la adaptación dinámica del modelo de lenguaje empleado por el reconocedor de habla incluido en un sistema de diálogo. Mediante el empleo de información de diálogo (las probabilidades a posteriori que el gestor de diálogo asigna a cada elemento de diálogo en cada turno), estimamos los pesos de interpolación correspondientes a cada LM. Los experimentos iniciales muestran una reducción de la tasa de error de palabra al emplear la información obtenida a partir de una frase para reestimar la misma frase

    Spanish generation from Spanish Sign Language using a phrase-based translation system

    Get PDF
    This paper describes the development of a Spoken Spanish generator from Spanish Sign Language (LSE – Lengua de Signos Española) in a specific domain: the renewal of Identity Document and Driver’s license. The system is composed of three modules. The first one is an interface where a deaf person can specify a sign sequence in sign-writing. The second one is a language translator for converting the sign sequence into a word sequence. Finally, the last module is a text to speech converter. Also, the paper describes the generation of a parallel corpus for the system development composed of more than 4,000 Spanish sentences and their LSE translations in the application domain. The paper is focused on the translation module that uses a statistical strategy with a phrase-based translation model, and this paper analyses the effect of the alignment configuration used during the process of word based translation model generation. Finally, the best configuration gives a 3.90% mWER and a 0.9645 BLEU
    corecore